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BASIC CONCEPT OF ALGORITHMIC APPROACH 



Our method assumes a Poisson distribution for daily counts from each of the different data 
sources. We modeled the daily mean as a baseline plus an outbreak component. The main 
contribution of our approach is to explicitly model the outbreak component. Guided by 
the distinct outbreak signatures in the training data, we assumed parametric forms for the 
outbreak profiles. This allows the development of a set of mixture likelihood ratio statistics, 
one for each possible outbreak starting date. The Mixture Likelihood Ratio Scan Statistic 
(MLRSS) is the result of scanning over the possible starting dates to find the most likely 
one. 

This approach, based on standard sequential change point methodologies, enjoys several 
properties. By taking into account the outbreak profiles, we are leveraging more information 
about the total process (both before and after an outbreak) than if we only considered 
nonspecific deviations from the "in-control" process. Furthermore, although not an objective 
of the contest, our method facilitates the estimation and prediction of the outbreak start time, 
severity, and length. This could be useful in planning and evaluating mitigation strategies 
once an outbreak is detected. Our method can be extended to multiple data sources, space- 
time surveillance, or multiple syndromes. Finally, the computation is quick enough to allow 
this method to be used when the data arrives much more frequently than once per day. 

Description of model 

For each data source, we assume o t , the daily counts on day t, are generated independently 

as 

<H ~ Pois(X t + S t {t ,9)) 

where At is the baseline mean for time t and St(t , 9) is the mean excess due to an outbreak. 
The baseline mean has the form X t = exp(X t '/3) where the vector X t contains terms for day 
of week and seasonal effects and the vector (3 are parameters that are estimated from the 
training data. If there is no outbreak, 5 t (t , 6) is equal to zero and we denote the density of 
Ot as fo(o t ). If there is an outbreak, the profile, 5 t (t ,9), is a function of the start time of 
the outbreak, t , and shape parameters, 9. The density of o t under an outbreak is denoted 
fi(o t ;S t {t ,9)). 

Outline of surveillance methodology 

We found evidence in the training data of distinct outbreak signatures for each data source 
during an outbreak. To incorporate this information into our surveillance statistics we 
adopted a likelihood ratio based approach. 

Let A* o (#) be the likelihood ratio (LR) for an outbreak at time t which started at time 
t versus no outbreak. Under our assumptions, this LR is 

A t a JJ fi(o s ;S s (to,9)) = -Q c _g s(foig) / 1 | 5 8 (t o ,0) \ Oa ^ 
s =t fo(° s ) s=to V K J 
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While we assume knowledge of the parametric form of the outbreaks profiles, we do not 
know the exact parameter values for a given outbreak. One approach for dealing with this 
uncertainty is to integrate the likelihood ratio with respect to some probability density h of 
9. This approach, termed Mixture Likelihood Ratio (MLR) ([TJ, [2]) creates a new statistic 
defined in @. If h is a discrete uniform distribution with mass at {9i, 82, ■ ■ ■ , 9 ng }, the MLR 
simplifies to the equation on the right side of 

s to , t = / Al o (e)h(9)de = -J2 At M) ( 2 ) 

For each time t, S tot t is calculated for every outbreak starting time t in a window W t . 
The window is set to limit the amount of past data considered for the starting time of 
the outbreak. By scanning over the possible outbreak start times, we obtain the Mixture 
Likelihood Ratio Scan Statistic (MLRSS) 

R t = max S t t (3) 

The MLRSS provides evidence that an outbreak has begun sometime prior to the current 
time t. Therefore, this statistic will continue to take a large value, even after the outbreak 
period, as long as W t still contains the outbreak start time. However, the Technical Contest 
evaluates an algorithm score that assesses evidence that an outbreak is occurring at a par- 
ticular time. To get an appropriate algorithm score, we took the least squares slope estimate 
of {R s : s = t — S, . . . ,t} over the last 5 + 1 days. Thus our algorithm score, a t took the 
form of a weighted sum where w s are the weights given to estimate the slope 

t 

at = / J w s R s (4) 

s=t-S 

This algorithm score is large when R t is increasing dramatically and around zero when R t 
is essentially constant. 

ADAPTATIONS FOR THE CONTEST 

The characteristics of each day were described by variables in the vector X t , which was 
comprised of a weekday /weekend indicator, sine and cosine functions of time with 1, 2, 4, 8, 
and 16 periods per year, and the interactions between the weekday /weekend indicator and 
the sine/cosine functions with 1 and 2 periods per year. The /3 coefficients were estimated 
from the training baseline data and held fixed in the subsequent analysis of the testing data. 

The most important adaptation of our approach is determining the parametric form of 
the outbreak profiles, 5 t (t , 6). For each data source, we extracted the 30 outbreak signatures 
by subtracting the common baseline count from the daily counts. Visual inspection of these 
outbreak signatures led us to assume three parametric outbreak profiles. Figure [T] shows the 
outbreak profile estimated for the first training outbreak of each data source. 
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Figure 1: Outbreak profiles (lines) estimated for the first training outbreak (points) of each 
data source. The x-axis is days into outbreak and the y-axis is the excess counts. 



The mathematical form of these outbreak profiles is shown in Table [Tj For the ED and 
OTC data sources we used a log-normal and Gaussian kernel, respectively, with 9 = (c, //, a). 
The TH outbreak signatures where more complicated, often bimodal. Therefore we used a 
two-component Gaussian mixture with 9 = (c, /ii, /z 2 , &)■ For all of these data types c affects 
the severity of the outbreak, \x affects the peak day (or days) of the outbreak, and a the 
duration of the outbreak. 

In addition to using the training data to determine the parametric form of the outbreak 
curves, we also used the 30 training outbreak signatures to calculate values for the outbreak 
parameters 9. For the j th (j — 1, . . . ,30) training outbreak signature, we estimated 9 using 
maximum likelihood, giving no = 30 potential curves indexed by 9±, . . . ,6*30 for each data 
source. This provided a uniform discrete distribution over the possible values of 9 which 
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Table 1: Mathematical form of the outbreak profiles. 



3 



were used in (2) by plugging in 9j = 9j. 

Equation (3) required a window W t to reduce unnecessary computation [3]. We used an 
adaptive window which depends on the most likely outbreak starting time estimated from 
the previous day We used the window 

W t = {min(t*,t- 10),..., t- 1} 

where t* is the value of t Q maximizing St a ,t-i- This window always includes at least 10 days, 
but will extend if the current estimate of the start time is further in the past. 

In the algorithm scores for ED, OTC, and TH, we used S — 7, S — 12, and S — 10 
respectively in The weights are given by 
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where Xi = i and x = Xi) /(S + 1). 



IMPLEMENTATION DETAILS 



For OTC and TH data, we found that this procedure outperformed several standard meth- 
ods. However, for ED data our approach gave similar results to the commonly-used and 
computationally-convenient exponentially weighted moving average (EWMA) approach. There- 
fore, for ED data we used the EWMA algorithm score a t = (1 — 4>) a t-i + ^ti where a$ = 
and r t = max(o t — A t , 0) / \f\ t is the truncated standardized residual. There were no attempts 
to address outliers for this data source and we used = 0.25. 

For the OTC and TH data, we used an ad-hoc outlier remediation step in the calculation 
of the mixture likelihood ratios in 0. We looked for large deviations in the standardized 
residuals from a fitted outbreak profile. If 

o s - (\ s + 6 s (k,6j)j 
max > 7 

for 7 = 23, we changed o s * to X s * + 5 s *(k, 9j), where s* is the time with the largest residual. 



LESSONS LEARNED AND FUTURE RESEARCH DIRECTIONS 

Methods that attempt to model the outbreak profile directly are sensitive to the information 
available concerning the profile's form. Therefore in real life situations these methods may 
be more appropriate for influenza and E. coli outbreaks than they would be to anthrax. 
Epidemic modelers often build susceptible-infected-recovered (SIR) models for disease out- 
breaks that are based on infection and recovery rates. We are currently working on including 
these types of models into our method. 
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In real world application of this methodology assuming a known baseline mean will 
probably lead to inaccurate surveillance. Performing periodic checks for the accuracy of the 
baseline and incorporating uncertainty into the baseline parameters could improve overall 
detection performance. 

CRITIQUE OF CONTEST METHODOLOGY AND SUGGESTIONS FOR FU- 
TURE CONTESTS 

Our first suggestion for future contests is to require the contestants to learn the baseline in 
the testing data. Our experience suggests the parameters used to generate the training data 
baseline were the same or very similar to those used for the testing data baseline. A more 
realistic situation would require the contestants to simultaneously estimate the baseline and 
find outbreaks. 

Our second suggestion is to modify the scoring to be a real-world cost function. The cost 
function should incorporate costs due to false alarms as well as costs due to the detection 
delay of a true outbreaks. In particular, there should be a difference in early versus late 
detection of the OTC and TH data. The scoring system for the ED data implicitly implies 
a cost function which is linear in detection delay. We suggest this relationship is not linear 
and that it may even depend on the outbreak severity. 

Overall we were very pleased with this contest and enjoyed ourselves. Thank you for the 
opportunity to participate. 
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